Multi-lingual Indexing Support for CLIR using Language Modeling
نویسندگان
چکیده
An indexing model is the heart of an Information Retrieval (IR) system. Data structures such as term based inverted indices have proved to be very effective for IR using vector space retrieval models. However, when functional aspects of such models were tested, it was soon felt that better relevance models were required to more accurately compute the relevance of a document towards a query. It was shown that language modeling approaches [1] in monolingual IR tasks improve the quality of search results in comparison with TFIDF [2] algorithm. The disadvantage of language modeling approaches when used in monolingual IR task as suggested in [1] is that they would require both the inverted index (term-todocument) and the forward index (document-to-term) to be able to compute the rank of document for a given query. This calls for an additional space and computation overhead when compared to inverted index models. Such a cost may be acceptable if the quality of search results are significantly improved. In a Cross-lingual IR (CLIR) task, we have previously shown in [3] that using a bilingual dictionary along with term co-occurrence statistics and language modeling approach helps improve the functional IR performance. However, no studies exist on the performance overhead in a CLIR task due to language modeling. In this paper we present an augmented index model which can be used for fast retrieval while having the benefits of language modeling in a CLIR task. The model is capable of retrieval and ranking with or without query expansion techniques using term collocation statistics of the indexed corpus. Finally we conduct performance related experiments on our indexing model to determine the cost overheads on space and time.
منابع مشابه
Experiments in Cross Language Query Focused Multi-Document Summarization
The twin challenges of massive information overload via the web and ubiquitous computers present us with an unavoidable task: developing techniques to handle multilingual information robustly and efficiently, with as high quality performance as possible. Previous research activities on multilingual information access systems have studied cross-language information retrieval (CLIR), information ...
متن کاملCross - lingual Information Retrieval Model based on Bilingual Topic Correlation ⋆
How to construct relationship between bilingual texts is important to effectively processing multi-lingual text data and cross language barriers. Cross-lingual latent semantic indexing (CL-LSI) corpus-based doesnot fully take into account bilingual semantic relationship. The paper proposes a new model building semantic relationship of bilingual parallel document via partial least squares (PLS)....
متن کاملAINLP at NTCIR-6: Evaluations for Multilingual and Cross-Lingual Information Retrieval
In this paper, a multilingual cross-lingual information retrieval (CLIR) system is presented and evaluated in NTCIR-6 project. We use the language-independent indexing technology to process the text collections of Chinese, Japanese, Korean, and English languages. Different machine translation systems are used to translate the queries for bilingual and multilingual CLIR. The experimental results...
متن کاملMelange: Components for Cross-Lingual Retrieval
We present the finalized version of our cross-lingual search engine Melange, and results obtained by running it on WebCLEF topics in an attempt to solve Mixed Monolingual and Multilingual tasks. We concentrate on certain features of the system which are relevant to the CLIR field and which can be developed further independently. These are our data extraction and indexing methods, our language d...
متن کاملOverview of CLIR Task at the Fifth NTCIR Workshop
The purpose of this paper is to overview research efforts at the NTCIR-5 CLIR task, which is a project of large-scale retrieval experiments on cross-lingual information retrieval (CLIR) of Chinese, Japanese, Korean, and English. The project has three sub-tasks, multi-lingual IR (MLIR), bilingual IR (BLIR), and single language IR (SLIR), in which many research groups from over ten countries are ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- IEEE Data Eng. Bull.
دوره 30 شماره
صفحات -
تاریخ انتشار 2007